Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix possible panic when 'SetNode' is called #1685

Merged
merged 1 commit into from
Oct 21, 2021

Conversation

eggiter
Copy link
Member

@eggiter eggiter commented Aug 19, 2021

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 19, 2021
@eggiter
Copy link
Member Author

eggiter commented Aug 19, 2021

/assign @william-wang @k82cn @hzxuzhonghu

@@ -187,10 +187,14 @@ func (r *Resource) Add(rr *Resource) *Resource {
return r
}

//Sub subtracts two Resource objects.
// Sub subtracts two Resource objects with assertion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this modification is clear, I think the original function is also acceptable.

@Thor-wl
Copy link
Contributor

Thor-wl commented Aug 23, 2021

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Aug 23, 2021
@eggiter
Copy link
Member Author

eggiter commented Aug 27, 2021

@william-wang @k82cn @hzxuzhonghu
PTAL!

@k82cn
Copy link
Member

k82cn commented Aug 27, 2021

/hold

This commit fixes possible panic when SetNode is called, which may happen when gpus of one machine are lost during the restarting of nvidia-device-plugin;

Is there any panic stack info for this issue?

@volcano-sh-bot volcano-sh-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 27, 2021
}

// Dry run, make sure all fields other than `State` are in the original state.
copy := ni.Clone()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why clone?

Copy link
Member Author

@eggiter eggiter Aug 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make make sure all fields are left untouched and in the original state, as i commented.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point where do we mutate this node?

Copy link
Member Author

@eggiter eggiter Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The original codes are all changing fields of nodeInfo.
  • Here, the clone i used, is merely a dry-run to make sure if all these changes makes this node not ready then we shall not to actually do this operation and just set state of this node to notready.

@eggiter
Copy link
Member Author

eggiter commented Aug 30, 2021

@k82cn

  • Stack traces are listing as follows;
  • Detailed re-produce procedures can be found in UT TestNodeInfo_SetNode;
panic: resource is not sufficient to do operation: <cpu 5000.00, memory 5000000000.00> sub <cpu 6000.00, memory 6000000000.00> [recovered]
	panic: resource is not sufficient to do operation: <cpu 5000.00, memory 5000000000.00> sub <cpu 6000.00, memory 6000000000.00>

goroutine 28 [running]:
testing.tRunner.func1.1(0x278c940, 0xc00007f290)
	/usr/local/go/src/testing/testing.go:1072 +0x552
testing.tRunner.func1(0xc000182600)
	/usr/local/go/src/testing/testing.go:1075 +0x328
panic(0x278c940, 0xc00007f290)
	/usr/local/go/src/runtime/panic.go:975 +0x4c6
volcano.sh/volcano/pkg/scheduler/util/assert.Assert(0x29e9a00, 0xc00044c200, 0x77)
	/workspace/go/src/volcano.sh/volcano/pkg/scheduler/util/assert/assert.go:33 +0x205
volcano.sh/volcano/pkg/scheduler/util/assert.Assertf(0xc0004d6600, 0x29e9a8c, 0x39, 0xc0004e7898, 0x2, 0x2)
	/workspace/go/src/volcano.sh/volcano/pkg/scheduler/util/assert/assert.go:43 +0xa5
volcano.sh/volcano/pkg/scheduler/api.(*Resource).Sub(0xc0004d67a0, 0xc0004d6620, 0x0)
	/workspace/go/src/volcano.sh/volcano/pkg/scheduler/api/resource_info.go:192 +0x15e
volcano.sh/volcano/pkg/scheduler/api.(*NodeInfo).SetNode(0xc000318e70, 0xc0004ea300)
	/workspace/go/src/volcano.sh/volcano/pkg/scheduler/api/node_info.go:329 +0x5a5
volcano.sh/volcano/pkg/scheduler/api.TestNodeInfo_SetNode(0xc000182600)
	/workspace/go/src/volcano.sh/volcano/pkg/scheduler/api/node_info_test.go:222 +0xda5
testing.tRunner(0xc000182600, 0x2a50500)
	/usr/local/go/src/testing/testing.go:1123 +0x1a3
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:1168 +0x648

@eggiter eggiter requested a review from hzxuzhonghu August 31, 2021 02:50
@hzxuzhonghu
Copy link
Collaborator

@eggiter we need to fix func (r *Resource) Sub(rr *Resource) *Resource {

@hzxuzhonghu
Copy link
Collaborator

Amazing there is panic in the functiuon, should remove all the stuff

@k82cn
Copy link
Member

k82cn commented Aug 31, 2021

Detailed re-produce procedures can be found in UT TestNodeInfo_SetNode;

AFAIK, we'll ignore not ready Nodes; is there any e2e case to trigger the panic?

@eggiter
Copy link
Member Author

eggiter commented Aug 31, 2021

Detailed re-produce procedures can be found in UT TestNodeInfo_SetNode;

AFAIK, we'll ignore not ready Nodes; is there any e2e case to trigger the panic?

  • Yes, nodes which are not ready are already ignored.
  • But this situation here is:
    • the node is Ready but the allocatable gpu becomes '0'.
    • furthermore, this event triggers NodeUpdate event, which will call SetNode underneath and the panic occurred.

@eggiter
Copy link
Member Author

eggiter commented Aug 31, 2021

@eggiter we need to fix func (r *Resource) Sub(rr *Resource) *Resource {

can you guys leave this fix to me? and you can tell me the idea of how to resolve if this PR can not achieve this puerpose.

@eggiter
Copy link
Member Author

eggiter commented Sep 6, 2021

@k82cn @hzxuzhonghu what's the current status?

@k82cn
Copy link
Member

k82cn commented Sep 7, 2021

the node is Ready but the allocatable gpu becomes '0'.

We should ignore such nodes during scheduling cycle.

@eggiter
Copy link
Member Author

eggiter commented Sep 7, 2021

the node is Ready but the allocatable gpu becomes '0'.

We should ignore such nodes during scheduling cycle.

It will ignored by setting status of node to NotReady, which is done by this commit.

@Thor-wl
Copy link
Contributor

Thor-wl commented Sep 7, 2021

I think such nodes can act as CPU nodes and continue to work if device plugin doesn't work well.

@eggiter
Copy link
Member Author

eggiter commented Sep 7, 2021

I think such nodes can act as CPU nodes and continue to work if device plugin doesn't work well.

Sorry to tell you but, the machine can lost its memory occasionally. It'd be better to set status to NotReady and ignore in scheduling cycle.

@eggiter eggiter requested a review from Thor-wl September 8, 2021 07:09
Signed-off-by: lvhaodong <lvhaodong@kuaishou.com>
@eggiter eggiter force-pushed the fix-panic-when-setnode branch from bea046b to 83a58c3 Compare October 20, 2021 12:25
@volcano-sh-bot volcano-sh-bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 20, 2021
@eggiter
Copy link
Member Author

eggiter commented Oct 21, 2021

@Thor-wl @k82cn PTAL!

@k82cn
Copy link
Member

k82cn commented Oct 21, 2021

/lgtm
/approve

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 21, 2021
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: k82cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 21, 2021
@Thor-wl
Copy link
Contributor

Thor-wl commented Oct 21, 2021

/lgtm
/approve

@Thor-wl Thor-wl removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 21, 2021
@volcano-sh-bot volcano-sh-bot merged commit 25d4f1f into volcano-sh:master Oct 21, 2021
@eggiter eggiter deleted the fix-panic-when-setnode branch October 21, 2021 06:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants